“Machines can never think as humans do but just because something thinks differently from you, does it mean it’s not thinking?” The Imitation Game.
This kernel uses TMDb (The Movie Database) 5000 Movie dataset to build a content-based recommendation engine using plots and metadata. Modern recommendation systems use votes and likes by the user, but perhaps for a plane new user this approach can come in handly.
The dataset is nicely organized and pretty clean, it consists of two files, one with movie information like budget, revenue, user´s votes, etc. And a second that list cast and crew, which can be join by the movie_id column. First i do some exploration to find some cool facts about top actors, directors and get some visualizations. Then i go through three approach to build a recommendation system. Finally(in Progress) i try to use a LDA (Latent Dirichlet Allocation) topic model as a reduction technique to see if i can get similar results as previous.
I do heavily use on tydiverse universe. tm and tidytext for text manipulation and topicmodels for LDA. Also, because it comes perfect for the kernel, all visualization have wesanderson color palettes
library(jsonlite)
library(tidyverse)
library(magrittr)
library(tm)
library(tidytext)
library(topicmodels)
library(wesanderson)
library(GGally)
movies_db <- read_csv("tmdb-5000-movie-dataset/tmdb_5000_movies.csv")
credits_db <- read_csv("tmdb-5000-movie-dataset/tmdb_5000_credits.csv")
glimpse(movies_db)
## Observations: 4,803
## Variables: 20
## $ budget <int> 237000000, 300000000, 245000000, 25000000...
## $ genres <chr> "[{\"id\": 28, \"name\": \"Action\"}, {\"...
## $ homepage <chr> "http://www.avatarmovie.com/", "http://di...
## $ id <int> 19995, 285, 206647, 49026, 49529, 559, 38...
## $ keywords <chr> "[{\"id\": 1463, \"name\": \"culture clas...
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ original_title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ overview <chr> "In the 22nd century, a paraplegic Marine...
## $ popularity <dbl> 150.43758, 139.08262, 107.37679, 112.3129...
## $ production_companies <chr> "[{\"name\": \"Ingenious Film Partners\",...
## $ production_countries <chr> "[{\"iso_3166_1\": \"US\", \"name\": \"Un...
## $ release_date <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012...
## $ revenue <dbl> 2787965087, 961000000, 880674609, 1084939...
## $ runtime <int> 162, 169, 148, 165, 132, 139, 100, 141, 1...
## $ spoken_languages <chr> "[{\"iso_639_1\": \"en\", \"name\": \"Eng...
## $ status <chr> "Released", "Released", "Released", "Rele...
## $ tagline <chr> "Enter the World of Pandora.", "At the en...
## $ title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ vote_average <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7...
## $ vote_count <int> 11800, 4500, 4466, 9106, 2124, 3576, 3330...
It can be seen there are some columns in JSON format like genres and production_companies, the same happens with crew and cast in the credits_db. In every cell of theese columns there are several values. e.g genres can have two values for one movie but four for other. Lets create some dataframe out of those columns, keeping movie_id to join later on. The code for this task comes from this kernel
genres <- movies_db %>%
filter(nchar(genres) > 2) %>%
mutate(aux_json = lapply(genres, fromJSON)) %>%
unnest(aux_json) %>%
select(id, genres = name) %>%
mutate_if(is.character, factor)
prod_comp <- movies_db %>%
filter(nchar(production_companies) > 2) %>%
mutate(aux_json = lapply(production_companies, fromJSON)) %>%
unnest(aux_json) %>%
select(id, production_companies = name) %>%
mutate_if(is.character, factor)
cast <- credits_db %>%
filter(nchar(cast) > 2) %>%
mutate(aux_json = lapply(cast, fromJSON)) %>%
unnest(aux_json) %>%
filter(order < 5) %>%
select(movie_id, name)
director <- credits_db %>%
filter(nchar(crew) > 2) %>%
mutate(aux_json = lapply(crew, fromJSON)) %>%
unnest(aux_json) %>%
filter(department == 'Directing' & job == 'Director') %>%
select(movie_id, name)
head(genres)
| id | genres |
|---|---|
| 19995 | Action |
| 19995 | Adventure |
| 19995 | Fantasy |
| 19995 | Science Fiction |
| 285 | Adventure |
| 285 | Fantasy |
Lets use ggpairs function of GGally package to inspect correlation and distribution of some variables
lowerFn <- function(data, mapping, method = "lm") {
ggplot(data = data, mapping = mapping) +
geom_point(colour = "slategray4", size = 2, shape = 18) +
geom_smooth(method = method, color = "coral3") +
theme_minimal() +
theme(axis.text.x = element_text(size = 8, angle = 45))
}
movies_db %>%
select(budget, revenue, release_date, vote_average, vote_count, runtime) %>%
na.omit() %>%
ggpairs(lower = list(continuous = wrap(lowerFn, method = "lm")),
diag = list(continuous = wrap("barDiag", fill = 'skyblue1', colour = "skyblue4")),
upper = list(continuous = wrap("cor", size = 5)),
progress = FALSE)
We see some variables that are skew like budget and revenue that also have the strongest positive correlation.
The dataset consist of movies mainly released after 1960, we also see a big jump in the number of movies released after the 90.
Vote average is dense around 6, with a little concentration in 0, probably those are expected movies with big production that dind´t like. Vote count is also skew.
Runtime is dense around 100 minutes
From df genres lets take the first genre of every movie assuming is the most representative and join with director df to get the top directors by number of movies, filling the bar with the genre of every movie
# Keep only the first genre
genre_1 <- genres %>% group_by(id) %>% summarise(genre = first(genres))
# Identify top directors by num of movies
top_directors <- director %>%
group_by(name) %>%
summarise(total_n = n()) %>%
top_n(20, wt = total_n)
director %>%
filter(name %in% top_directors$name) %>%
left_join(genre_1, by = c('movie_id' = 'id')) %>%
count(name, genre) %>%
left_join(top_directors, by = 'name') %>%
ggplot(aes(x = reorder(name, total_n), y = n, fill = genre)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = wes_palette('Darjeeling1', length(unique(genre_1$genre)), type = 'continuous')) +
labs(x = 'Directors', y = 'Number of Movies')
We see genres that prevail in the top20 directors are action, drama, and comedy. Lets do the same with actors
# Identify top actors by number of movies
top_actors <- cast %>%
group_by(name) %>%
summarise(total_n = n()) %>%
top_n(20, wt = total_n)
cast %>%
filter(name %in% top_actors$name) %>%
left_join(genre_1, by = c('movie_id' = 'id')) %>%
count(name, genre) %>%
left_join(top_actors, by = 'name') %>%
ggplot(aes(x = reorder(name, total_n), y = n, fill = genre)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = wes_palette('Darjeeling1', length(unique(genre_1$genre)), type = 'continuous')) +
labs(x = 'Actors', y = 'Number of Movies')
Lets look now the directors-actors colaborations with total revenues
# Join director, cast and revenue dfs
director %>%
left_join(cast, by = 'movie_id') %>%
left_join(movies_db %>% select(id, revenue), by = c('movie_id' = 'id')) %>%
transmute(director = name.x, actor = name.y, revenue,
colaboration = paste(director, actor, sep = '-')) %>%
filter(director != actor) %>%
group_by(colaboration) %>%
summarise(n = n(), revenue = sum(revenue)) %>%
top_n(15, wt = n) %>%
ggplot(aes(x = reorder(colaboration, revenue), y = n)) +
geom_col(fill = wes_palette('Royal1', 1, type = 'discrete')) +
coord_flip() +
geom_text(aes(label = paste(round(revenue/1000000,1), 'M')), y = 0.5, size = 3,
colour = wes_palette('Darjeeling2', 1, type = 'discrete')) +
labs(x = 'Top Colaborations', y = 'Number of Movies')
What about the best movies according to this dataset?. First lets see in more detail the distribution of votes
movies_db %>%
select(vote_average, vote_count) %>%
summary()
## vote_average vote_count
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 5.600 1st Qu.: 54.0
## Median : 6.200 Median : 235.0
## Mean : 6.092 Mean : 690.2
## 3rd Qu.: 6.800 3rd Qu.: 737.0
## Max. :10.000 Max. :13752.0
Lets filter the top 50% of vote counts, that is 235 votes and plot the top 20 movies by vote avg
movies_db %>%
filter(vote_count > 235) %>%
select(id, title, vote_average) %>%
left_join(genre_1, by = 'id') %>%
top_n(20, wt = vote_average) %>%
ggplot(aes(x = reorder(title, vote_average), y = vote_average, fill = genre)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = wes_palette('Rushmore', 6, type = 'continuous')) +
labs(x = 'Movies', y = 'Vote Avg')
We see here that most of the movies are drama, though i wouldnt label ‘Psycho’ as drama since it´s a cult horror movie, or ‘American History X’ which i consider more like a crime movie, but that´s quite subjective. It looks like genre is prevailing by drama, lets look at it considering one movie could have several genres
genres %>%
count(genres) %>%
ggplot(aes(x = reorder(genres, n), y = n)) +
geom_col(fill = wes_palette('Zissou1', 1, type = 'discrete')) +
coord_flip() +
labs(x = 'Genres', y = 'Number of Movies')
Lets look for Production Companies. Now, 1 movie could have several production companies, lets assume they split budget and revenue equally which is obviously not true, but could give a sense of the size of the company. The number inside the columns are mean budget
# Get number of movies by prod companies
prod_comp_n <- prod_comp %>%
count(id)
prod_comp %>%
left_join(movies_db %>% select(id, budget), by = 'id') %>%
left_join(prod_comp_n, by = 'id') %>%
transmute(production_companies, budget = budget/n) %>%
group_by(production_companies) %>%
summarise(n = n(), budget = mean(budget)) %>%
top_n(20, wt = n) %>%
ggplot(aes(x = reorder(production_companies, n), y = n)) +
geom_col(fill = wes_palette('Royal1', 1, type = 'discrete')) +
coord_flip() +
geom_text(aes(label = paste(round(budget/1000000,1), 'M')), y = 20, size = 3,
colour = wes_palette('Darjeeling2', 1, type = 'discrete')) +
labs(x = 'Prod Company', y = 'Number of Movies')
Lets look for mean profit with this assumption, i use plotly in this one since it gives labels by interaction. I filter companies that have more than 5 movies
prod_comp %>%
left_join(movies_db %>% select(id, budget, revenue), by = 'id') %>%
left_join(prod_comp_n, by = 'id') %>%
transmute(production_companies, budget = budget/n, revenue = revenue/n, profit = revenue-budget) %>%
group_by(production_companies) %>%
summarise(n = n(), budget = mean(budget), revenue = mean(revenue), profit = mean(profit)) %>%
filter(budget != 0, n > 5) %>%
plotly::plot_ly(x = ~profit, y = ~budget, color = ~n, type = 'scatter',
mode = 'markers', text = ~production_companies, size = ~n,
colors = (c(wes_palette('Zissou1', 2, type = 'continuous'))))
Lets check for na and duplicated names
movies_db %>% select(title) %>% is.na() %>% which()
## integer(0)
movies_db %>% select(title) %>% duplicated() %>% which()
## [1] 2878 3694 4268
Since there are just three obs with duplicated names lets just remove it.
The first approach uses genres, director and actors to recommend movies. First i remove spaces between words and paste all in one df to form the corpus which is the object that requeries tm. Then we pass the function to calculate frecuencies, in this case we will use binary weight, since an actor, director or genre can only appear once
# group labels
pasted_genres <- genres %>%
mutate(genres = str_replace_all(genres, ' ', '')) %>%
group_by(id) %>%
summarise(genres = paste(genres, collapse = ' ' ))
pasted_directors <- director %>%
mutate(director = str_replace_all(name, ' ', '')) %>%
group_by(movie_id) %>%
summarise(director = first(director))
pasted_cast <- cast %>%
mutate(actors = str_replace_all(name, ' ', '')) %>%
group_by(movie_id) %>%
summarise(actors = paste(actors, collapse = ' ' ))
# Create corpus
corpus_metadata <- movies_db %>%
select(id, title) %>%
distinct(title, .keep_all = TRUE) %>%
left_join(pasted_genres, by = 'id') %>%
left_join(pasted_directors, by = c('id' = 'movie_id')) %>%
left_join(pasted_cast, by = c('id' = 'movie_id')) %>%
transmute(doc_id = title, text = paste(genres, director, actors)) %>%
as.data.frame() %>%
DataframeSource() %>%
Corpus()
# Form dtm with binary weighting
dt_md_bin <- DocumentTermMatrix(corpus_metadata,
control = list(weighting = function(x) weightBin(x)))
# Convert into Matrix
frec_matrix <- dt_md_bin %>% as.matrix()
frec_matrix[1:5, 1:5]
## Terms
## Docs action adventure fantasy
## Avatar 1 1 1
## Pirates of the Caribbean: At World's End 1 1 1
## Spectre 1 1 0
## The Dark Knight Rises 1 0 0
## John Carter 1 1 0
## Terms
## Docs jamescameron michellerodriguez
## Avatar 1 1
## Pirates of the Caribbean: At World's End 0 0
## Spectre 0 0
## The Dark Knight Rises 0 0
## John Carter 0 0
Lets create a function that receive a title and returns recommendation. The function compares the terms of the input title with all the matrix, paste the votes of the movie and return the ones that match more terms and have more votes. Lets see recommendations for Titanic
recommender <- function(matr, title = 'Avatar', db = movies_db, n = 10){
# matr: matrix of weigth
# title: string title of the movie
# db: original df to extract votes
# n: number of recommendations to provide
ind <- which(row.names(matr) == title)
frec_movie <- matr[ind,]
# Sum vector and matrix to identify mutual terms
mutual_terms <- sweep(matr, 2, frec_movie, '+')
# Sum mutual terms
most_frec <- apply(mutual_terms, 1, function(x) {sum(x == 2)})
# Join with original df to get vote_average
recomms <- data.frame(title = names(most_frec), frec = most_frec) %>%
mutate_if(is.character, factor) %>%
left_join(db %>%
select(title, vote_average) %>%
distinct(title, .keep_all = TRUE) %>%
mutate_if(is.character, factor),
by = 'title') %>%
arrange(desc(frec), desc(vote_average)) %>%
select(title) %>%
head(n)
return(recomms)
}
recommender(frec_matrix, 'Titanic')
| title |
|---|
| Titanic |
| Revolutionary Road |
| The Beach |
| The Departed |
| Eternal Sunshine of the Spotless Mind |
| Shutter Island |
| Pandora’s Box |
| What’s Eating Gilbert Grape |
| The Great Gatsby |
| The Curious Case of Benjamin Button |
I leave the same movie as recommendations to know that the function is working properly. We see that for Titanic we receive movies mainly performer by Leonardo DiCaprio or Kate Winslet, in fact the first recommendation is a romantic drama starred by both.
Lets now use the plot of the movie as a recommendation. Now plots are longer text than just names and labels of genres, so we need to pre process the corpus:
Remove puntuation and numbers.
Remove stopwords, theese are common words that do not provide real meaning to identify the plot, e.g words like the, is, at, which, on, etc.
Stem the words, which consist in using the root of the word to group them with same meaning but in different tense. e.g working, worked, works will now convert to work
All this can be done easily with the tm package, though normaly the pre process in text minning requires more attention, like choosing appropiate dictionaries and looking deeply into the corpus structure.
Other difference is that we now use a Tf-Ifd (Term frequency-Inverse document frequency) weigth which consist in the basic frequency of term in document but is offset by the number of documents in the corpus that contain the word, this helps to adjust the words that appear frequently in the corpus and general meaning doesn´t help to identify groups. This bolg gives more insights about Tf-Idf
# Create corpus
corpus_plot <- movies_db %>%
select(title, overview) %>%
distinct(title, .keep_all = TRUE) %>%
na.omit() %>%
transmute(doc_id = title, text = overview) %>%
as.data.frame() %>%
DataframeSource() %>%
Corpus() %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("SMART")) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
# Form dtm with Tf-Idf weighting
dt_plot_tfidf <- corpus_plot %>%
DocumentTermMatrix(control = list(weighting = function(x) weightTfIdf(x)))
tfidf_matrix <- dt_plot_tfidf %>% as.matrix()
tfidf_matrix[1:5, 1:5]
## Terms
## Docs alien centuri civil
## Avatar 0.4529051 0.4570934 0.5231809
## Pirates of the Caribbean: At World's End 0.0000000 0.0000000 0.0000000
## Spectre 0.0000000 0.0000000 0.0000000
## The Dark Knight Rises 0.0000000 0.0000000 0.0000000
## John Carter 0.0000000 0.0000000 0.0000000
## Terms
## Docs dispatch marin
## Avatar 0.6559366 0.5975526
## Pirates of the Caribbean: At World's End 0.0000000 0.0000000
## Spectre 0.0000000 0.0000000
## The Dark Knight Rises 0.0000000 0.0000000
## John Carter 0.0000000 0.0000000
Now we need a function to get the recommendations but the one we already did only works with binary weights, so we create a new one that calculates the cosine similarity between the weights of the movie input and the rest of the matrix. A nice explanation of cosine similarity can be found here
#function by: https://codereview.stackexchange.com/questions/159396/cosine-similarity-of-one-vector-with-many response
cosineSim <- function(x, m){
y = x / sqrt(crossprod(x));
return(as.vector((m %*% y) / sqrt(rowSums(m^2))) );
}
recommender_cosine <- function(matr, title = 'Avatar', n = 10){
# matr: matrix of weigth
# title: string title of the movie
# n: number of recommendations to provide
ind <- which(row.names(matr) == title)
t <- matr[ind,] %>% as.numeric()
cosine_dist_mat <- cosineSim(t, matr)
term_dist <- data.frame(title = row.names(matr), distance = cosine_dist_mat)
recommendations <- term_dist %>%
arrange(desc(distance)) %>%
head(n)
return(recommendations)
}
recommender_cosine(tfidf_matrix, 'Titanic')
| title | distance |
|---|---|
| Titanic | 1.0000000 |
| Raise the Titanic | 0.1534576 |
| Ghost Ship | 0.1260145 |
| Event Horizon | 0.0997641 |
| Niagara | 0.0992658 |
| I Can Do Bad All By Myself | 0.0990588 |
| The Rose | 0.0943069 |
| Pieces of April | 0.0937866 |
| The Ballad of Jack and Rose | 0.0935104 |
| The Greatest | 0.0887989 |
We se for Titanic some recommendations about other ships movies like ‘Rise the Titanic’, ‘Ghost Ship’ and ‘The Chambermaid on the Titanic’. It also appears movies which includes the word ‘rose’ like ‘The Rose’ and ‘The Ballad of Jack and Rose’, in fact ‘Niagara’ main character is called Rose, so is matching the names characters with the one of Kate Winslet character. Lets see the words that appears in the ‘Titanic’ plot
corpus_plot[['Titanic']]$content
## [1] "year yearold woman name rose dewitt bukat tell stori granddaught lizzi calvert brock lovett lewi bodin bobbi buell anatoli mikailavich keldysh life set april ship call titan young rose board depart ship upperclass passeng mother ruth dewitt bukat fiancé caledon hockley drifter artist name jack dawson friend fabrizio de rossi win thirdclass ticket ship game explain stori departur death titan voyag april morn"
We see the word april, thats probably why one recommendations is ‘Pieces of April’, another fact is that stemming the words produces the word titanic be replaced by titan, and thats why we see ‘Wrath of the Titans’ which has nothing in common with Titanic. Obviously i need to focus in a better pre-process as we said previously.
Finally lets try a combined system but as we saw better results with metadata, we will put more weigth in director and actors. This system will use a Tf (Term frequency) weight since it doesn´t make much sense to offset actors, directors or genres
corpus_combined <- movies_db %>%
select(id, title, overview) %>%
distinct(title, .keep_all = TRUE) %>%
left_join(pasted_genres, by = 'id') %>%
left_join(pasted_directors, by = c('id' = 'movie_id')) %>%
left_join(pasted_cast, by = c('id' = 'movie_id')) %>%
transmute(doc_id = title, text = paste(genres, director, director, actors, actors, overview)) %>%
as.data.frame() %>%
DataframeSource() %>%
Corpus() %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("SMART")) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
dt_combined_tf <- corpus_combined %>%
DocumentTermMatrix(control = list(weighting = function(x) weightTf(x)))
tf_combined_matrix <- dt_combined_tf %>% as.matrix()
recommender_cosine(tf_combined_matrix, 'Titanic')
| title | distance |
|---|---|
| Titanic | 1.0000000 |
| Revolutionary Road | 0.1645461 |
| The Believer | 0.1500625 |
| Romance & Cigarettes | 0.1373786 |
| Ghost Ship | 0.1349008 |
| Iris | 0.1314604 |
| 11:14 | 0.1303436 |
| What’s Eating Gilbert Grape | 0.1295218 |
| Rent | 0.1254089 |
| The Four Feathers | 0.1246984 |
We see that it achieve a combination of both system, lets make a function in order to compare the systems
compare_engines <- function(.title = 'Avatar'){
syst_1 <- recommender(frec_matrix, title = .title)
syst_2 <- recommender_cosine(tfidf_matrix, title = .title) %>% select(title)
syst_3 <- recommender_cosine(tf_combined_matrix, title = .title) %>% select(title)
df <- cbind(syst_1, syst_2, syst_3) %>% setNames(c('metadata', 'plot', 'combined'))
return(df)
}
compare_engines('Fight Club')
| metadata | plot | combined |
|---|---|---|
| Fight Club | Fight Club | Fight Club |
| The Curious Case of Benjamin Button | Rudderless | The Curious Case of Benjamin Button |
| Panic Room | UHF | Panic Room |
| American History X | Ghost World | Se7en |
| Se7en | The Invention of Lying | The Painted Veil |
| Inglourious Basterds | Slither | Requiem for a Dream |
| Gone Girl | R100 | Keeping the Faith |
| Dallas Buyers Club | This Is Martin Bonner | The Big Short |
| Requiem for a Dream | Step Up 2: The Streets | The Heart of Me |
| The Heart of Me | After | The Illusionist |
compare_engines('Memento')
| metadata | plot | combined |
|---|---|---|
| Memento | Memento | Memento |
| Inception | Amnesiac | Amnesiac |
| The Prestige | When Did You Last See Your Father? | Red Planet |
| L.A. Confidential | Dark City | The Fugitive |
| The Fugitive | Getaway | Insomnia |
| Insomnia | Cat on a Hot Tin Roof | The Prestige |
| Disturbia | Rebecca | Buried |
| Buried | Gothika | The Butterfly Effect |
| Silent Hill: Revelation 3D | Eternal Sunshine of the Spotless Mind | L.A. Confidential |
| The Dark Knight | The Butterfly Effect | Ravenous |
After some play around i think the engines work with some limitation specially the one that only uses plot. The combined works pretty well consider it gives similar recommendations as the metadata-based that uses vote average
The last part of the kernel consist in apply LDA (Latent Dirichlet Allocation) as a reduction method and see if we can have similar results as previous.
LDA is a probabilistic model that allow observations to be explain by unobserved groups. It basically categorized documents in topics while this topics are formed by words. The cool fact about LDA is that it let the document to be formed by several topics as well that words can contribute to more than one topics. The downside is that it needs to set parameters from the beginning like alpha, beta and the number of topics to use, that are hard to set. A really nice explanation can be found here
dt_plot_tf <- corpus_plot %>%
DocumentTermMatrix(control = list(weighting = function(x) weightTf(x)))
lda_tf <- LDA(dt_plot_tf, k = 5,
control = list(estimate.alpha = FALSE, alpha = 5,
seed = 1234))
# code by: https://www.tidytextmining.com/topicmodeling.html
tidy(lda_tf, matrix = "beta") %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +
coord_flip() +
scale_fill_manual(values = wes_palette('Moonrise2', 5, type = 'continuous'))
Above are the main words that represent each topic, we can see it´s difficult to match a topic with a genre, we probably need to focus on the pre-process like group synonyms and remove words with poorly contribution. Either way lets see the recommendations
lda_documents <- tidy(lda_tf, matrix = "gamma")
df <- lda_documents %>%
spread(topic, gamma)
tmatrix <- as.matrix(df[,-1])
rownames(tmatrix) <- df$document
recommender_cosine(tmatrix, 'Titanic')
| title | distance |
|---|---|
| Titanic | 1.0000000 |
| The Book of Mormon Movie, Volume 1: The Journey | 0.9988538 |
| Youth in Revolt | 0.9988155 |
| A Turtle’s Tale: Sammy’s Adventures | 0.9985572 |
| Hud | 0.9984844 |
| The Homesman | 0.9981218 |
| The Crocodile Hunter: Collision Course | 0.9976165 |
| Urbania | 0.9972907 |
| The Ghastly Love of Johnny X | 0.9964940 |
| A Thin Line Between Love and Hate | 0.9962552 |
As expected the model provide meaningless recommendation, hopefully i would do an update with better results.
Thanks for reading my first kernel, i hope you found it useful!